movie_raw <-read_excel(here("c25/data/movies_2023-10-24.xlsx"),na =c("", "NA")) |># otherwise only "" is recognizedclean_names() |>type.convert(as.is =FALSE) |># convert all characters to factorsmutate(film_id =as.character(film_id), film =as.character(film))movies <- movie_raw |>select(film_id, imdb_pct10, fc_pctwins, rt_audiencescore, ebert, box_off_mult, budget, metascore, bw_rating, imdb_oscars, mentions, dr_love, gen_1, bacon_1, lang_1, drama, comedy, adventure, action, romance, fantasy, sci_fi, crime, thriller, animation, family, mystery, biography, music, horror, musical, war, history, sport, western, film)dim(movies)
[1] 201 36
Quick Check of Ingest
summary(movies)
film_id imdb_pct10 fc_pctwins rt_audiencescore
Length:201 Min. : 3.80 Min. :24 Min. :28.00
Class :character 1st Qu.:11.60 1st Qu.:42 1st Qu.:76.00
Mode :character Median :15.60 Median :52 Median :86.00
Mean :17.53 Mean :51 Mean :81.91
3rd Qu.:22.20 3rd Qu.:60 3rd Qu.:92.00
Max. :55.00 Max. :79 Max. :98.00
ebert box_off_mult budget metascore
Min. :1.000 Min. : 0.0013 Min. : 200000 Min. : 9.00
1st Qu.:2.875 1st Qu.: 2.6000 1st Qu.: 12000000 1st Qu.: 61.00
Median :3.500 Median : 4.7000 Median : 30000000 Median : 72.00
Mean :3.190 Mean : 8.5418 Mean : 59242257 Mean : 71.35
3rd Qu.:4.000 3rd Qu.: 9.3000 3rd Qu.: 90000000 3rd Qu.: 84.00
Max. :4.000 Max. :73.7000 Max. :356000000 Max. :100.00
NA's :25 NA's :20 NA's :19 NA's :10
bw_rating imdb_oscars mentions dr_love gen_1
Min. :0.000 Min. : 0.0000 Min. :1.000 No :124 F: 45
1st Qu.:1.000 1st Qu.: 0.0000 1st Qu.:1.000 Yes: 77 M:156
Median :3.000 Median : 0.0000 Median :1.000
Mean :2.135 Mean : 0.9849 Mean :1.249
3rd Qu.:3.000 3rd Qu.: 1.0000 3rd Qu.:1.000
Max. :3.000 Max. :11.0000 Max. :6.000
NA's :8 NA's :2
bacon_1 lang_1 drama comedy
Min. :1.000 English :177 Min. :0.0000 Min. :0.0000
1st Qu.:2.000 Japanese: 7 1st Qu.:0.0000 1st Qu.:0.0000
Median :2.000 Hindi : 5 Median :1.0000 Median :0.0000
Mean :1.886 Italian : 2 Mean :0.5721 Mean :0.3582
3rd Qu.:2.000 Arabic : 1 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :3.000 ASL : 1 Max. :1.0000 Max. :1.0000
(Other) : 8
adventure action romance fantasy
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
Mean :0.3333 Mean :0.2537 Mean :0.1692 Mean :0.1393
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
sci_fi crime thriller animation
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
Median :0.0000 Median :0.0000 Median :0.0000 Median :0.00000
Mean :0.1244 Mean :0.1045 Mean :0.1045 Mean :0.08955
3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
family mystery biography music
Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.00000
1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
Median :0.00000 Median :0.0000 Median :0.00000 Median :0.00000
Mean :0.08955 Mean :0.0597 Mean :0.05473 Mean :0.05473
3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.00000
horror musical war history
Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.00000
1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
Median :0.0000 Median :0.00000 Median :0.0000 Median :0.00000
Mean :0.0398 Mean :0.02985 Mean :0.0199 Mean :0.01493
3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000
Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.00000
sport western film
Min. :0.00000 Min. :0.000000 Length:201
1st Qu.:0.00000 1st Qu.:0.000000 Class :character
Median :0.00000 Median :0.000000 Mode :character
Mean :0.01493 Mean :0.004975
3rd Qu.:0.00000 3rd Qu.:0.000000
Max. :1.00000 Max. :1.000000
Data Cleaning
Let’s convert budget to express it in millions of US dollars
lang_eng should be 1/0 for English (n = 177) vs. Non-English
movies <- movies |>mutate(budget = budget /1000000,lang_eng =as.numeric(lang_1 =="English"))favstats(~ budget, data = movies) |>gt() |>fmt_number(columns = mean:sd, decimals =2)
min
Q1
median
Q3
max
mean
sd
n
missing
0.2
12
30
90
356
59.24
68.02
182
19
movies |>tabyl(lang_eng, lang_1) |>gt()
lang_eng
Arabic
ASL
Bengali
Danish
English
French
German
Hindi
Italian
Japanese
Mandarin
Norwegian
Persian
Spanish
0
1
1
1
1
0
1
1
5
2
7
1
1
1
1
1
0
0
0
0
177
0
0
0
0
0
0
0
0
0
Which outcome shall we choose?
We’re interested in a percentage measure (0-100) addressing how beloved the movie is, according to an audience.
Variable
NA
Description
imdb_pct10
0
% of 10-star public ratings in IMDB as of 2023-09
fc_pctwins
0
% of matchups won on Flickchart as of 2023-10
rt_audiencescore
0
Rotten Tomatoes Audience Score (% Fresh) as of 2023-10
top five genres: drama, comedy, adventure, action, romance
How many predictors can we use?
If we have a linear regression model with 201 observations (at most, some variables are missing, remember), then how many predictors can we realistically fit?
Important
A useful starting strategy when you’re not doing variable selection is that you need at least 15 observations for each coefficient you will estimate, including the intercept.
See https://hbiostat.org/bbr/ Frank Harrell, Biostatistics for Biomedical Research for more on this topic.
How Many Predictors (at maximum)?
Important
A useful starting strategy when you’re not doing variable selection is that you need at least 15 observations for each coefficient you will estimate, including the intercept.
The model will run, so long as you have more observations than cases, but that’s not a good standard to use.
Bigger samples are better, but sample size is often determined by pragmatic considerations.
A useful starting strategy when you’re not doing variable selection is that you need at least 15 observations for each coefficient you will estimate, including the intercept.
13 is really a maximum. We’d like to avoid fitting more than perhaps 10 coefficients (including the intercept)…
Each quantitative predictor requires one coefficient
Each binary predictor also requires one coefficient
When treated as multi-categorical, a factor with k levels requires k-1 coefficients
Second Cut: 9 predictors
Variable
Type
Description
imdb_pct10
Quant
% of 10-star public ratings in IMDB
rt_audiencescore
Quant
Rotten Tomatoes Audience Score (% Fresh)
box_off_mult
Quant
World Wide Gross Revenue (as multiple of budget)
metascore
Quant
Metascore (0-100 scale) from critic reviews
imdb_oscars
Quant
# of Oscar (Academy Award) wins
bw_rating
Quant
Bechdel-Wallace Test Criteria Met (0-3)
lang_eng
Binary
Is primary language English? (1 = Yes, 0 = No)
drama
Binary
Is drama listed in imdb_categories? (1 = Yes, 0 = No)
comedy
Binary
Is comedy listed in imdb_categories? (1 = Yes, 0 = No)
10 coefficients x 15 = 150 observations needed, at minimum. We have 201.
film_id fc_pctwins imdb_pct10 rt_audiencescore
Length:201 Min. :24 Min. : 3.80 Min. :28.00
Class :character 1st Qu.:42 1st Qu.:11.60 1st Qu.:76.00
Mode :character Median :52 Median :15.60 Median :86.00
Mean :51 Mean :17.53 Mean :81.91
3rd Qu.:60 3rd Qu.:22.20 3rd Qu.:92.00
Max. :79 Max. :55.00 Max. :98.00
box_off_mult metascore imdb_oscars bw_rating
Min. : 0.0013 Min. : 9.00 Min. : 0.0000 Min. :0.000
1st Qu.: 2.8000 1st Qu.: 61.00 1st Qu.: 0.0000 1st Qu.:1.000
Median : 4.7000 Median : 71.00 Median : 0.0000 Median :3.000
Mean : 8.5436 Mean : 71.11 Mean : 0.9751 Mean :2.124
3rd Qu.: 9.6000 3rd Qu.: 83.00 3rd Qu.: 1.0000 3rd Qu.:3.000
Max. :73.7000 Max. :100.00 Max. :11.0000 Max. :3.000
lang_eng drama comedy film
Min. :0.0000 Min. :0.0000 Min. :0.0000 Length:201
1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.0000 Class :character
Median :1.0000 Median :1.0000 Median :0.0000 Mode :character
Mean :0.8806 Mean :0.5721 Mean :0.3582
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000 Max. :1.0000
set.seed(431432)ctrl <-trainControl(method ="cv", number =10) ## caret package## train our model on those 10 foldsi6_train <-train(fc_pctwins ~ imdb_pct10 + rt_audiencescore + box_off_mult + metascore + imdb_oscars + bw_rating + lang_eng + drama + comedy,data = imp6, method ="lm", trControl = ctrl)
Summarize 10-fold cross-validation
i6_train
Linear Regression
201 samples
9 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 181, 181, 181, 180, 181, 181, ...
Resampling results:
RMSE Rsquared MAE
7.586781 0.6480235 6.089882
Tuning parameter 'intercept' was held constant at a value of TRUE